This is the Third Notebook that we are presenting, In this notebook we are looking to do all of the suggestions given to us from the presentation and some of the next steps outlined in the presentation.
There are 3 big things that we want to do in this notebook. The first is we want to do is adjust the revenue and budget for the time value of money. After that we want to do feature selection using two different methods: recursive feature selection (brute force method using a library called boruta) and feature selection using ANOVA. We will be running this on two different data sets namely the NA one where we drop all NAs and the imputed data set. We are doing it on these data sets specifically because they were what we analyzed in the second notebook and saw interesting results there. There was a bit about ANOVA mentioned in the second notebook and all of that applies here as well. The reason that we are doing two feature selections is because we want to test the difference between them. We want to see what features they recommend in common and the differences. After which we will see which feature selection method performs better using error metrics. We would also ideally like to do multiple train/test splits as we know that the error rate heavily depends on the train/test split and give a huge range of 100% (MAPE) in some cases. We also recognize that this may not be possible computationally as it already take a significant amount of time to run the model once let alone running it multiple times.
We also did an ANOVA feature selection in notebook two but the reason we are doing it here is to compare feature selection models as Joe was interested in which one would perform better. Additionally the ANOVA feature selection in notebook two was on the dataset where the time value of money was not adjusted for, so we can see the impact that changing time value of money alone has on the results.
The first thing we must do as always is load all of the libraries that we will be using
library(readr)
package 㤼㸱readr㤼㸲 was built under R version 4.0.3
library(stringr)
package 㤼㸱stringr㤼㸲 was built under R version 4.0.3
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.3Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2 v purrr 0.3.4
v tibble 3.0.4 v dplyr 1.0.2
v tidyr 1.1.2 v forcats 0.5.0
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.3package 㤼㸱tibble㤼㸲 was built under R version 4.0.3package 㤼㸱tidyr㤼㸲 was built under R version 4.0.3package 㤼㸱purrr㤼㸲 was built under R version 4.0.3package 㤼㸱dplyr㤼㸲 was built under R version 4.0.3package 㤼㸱forcats㤼㸲 was built under R version 4.0.3-- Conflicts ---------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(dplyr)
library(mice)
package 㤼㸱mice㤼㸲 was built under R version 4.0.3
Attaching package: 㤼㸱mice㤼㸲
The following objects are masked from 㤼㸱package:base㤼㸲:
cbind, rbind
library(VIM)
package 㤼㸱VIM㤼㸲 was built under R version 4.0.3Loading required package: colorspace
package 㤼㸱colorspace㤼㸲 was built under R version 4.0.3Loading required package: grid
Registered S3 method overwritten by 'data.table':
method from
print.data.table
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 㤼㸱VIM㤼㸲
The following object is masked from 㤼㸱package:datasets㤼㸲:
sleep
library(plyr)
package 㤼㸱plyr㤼㸲 was built under R version 4.0.3----------------------------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
----------------------------------------------------------------------------------------------------------
Attaching package: 㤼㸱plyr㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following object is masked from 㤼㸱package:purrr㤼㸲:
compact
library(tidyr)
library(ggplot2)
library(sf)
package 㤼㸱sf㤼㸲 was built under R version 4.0.3Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(sjmisc)
package 㤼㸱sjmisc㤼㸲 was built under R version 4.0.3Learn more about sjmisc with 'browseVignettes("sjmisc")'.
Attaching package: 㤼㸱sjmisc㤼㸲
The following object is masked from 㤼㸱package:purrr㤼㸲:
is_empty
The following object is masked from 㤼㸱package:tidyr㤼㸲:
replace_na
The following object is masked from 㤼㸱package:tibble㤼㸲:
add_case
library(highcharter)
package 㤼㸱highcharter㤼㸲 was built under R version 4.0.3Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(openair)
package 㤼㸱openair㤼㸲 was built under R version 4.0.3
library(zoo)
package 㤼㸱zoo㤼㸲 was built under R version 4.0.3
Attaching package: 㤼㸱zoo㤼㸲
The following objects are masked from 㤼㸱package:base㤼㸲:
as.Date, as.Date.numeric
library(countrycode)
package 㤼㸱countrycode㤼㸲 was built under R version 4.0.3
library(ggmap)
package 㤼㸱ggmap㤼㸲 was built under R version 4.0.3Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
Please cite ggmap if you use it! See citation("ggmap") for details.
library(blscrapeR) ##needed to get index for adjusting inflation
library(Boruta)
package 㤼㸱Boruta㤼㸲 was built under R version 4.0.3
library(randomForest)
package 㤼㸱randomForest㤼㸲 was built under R version 4.0.3randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 㤼㸱randomForest㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
combine
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
margin
library(mlbench)
package 㤼㸱mlbench㤼㸲 was built under R version 4.0.3
library(Metrics)
package 㤼㸱Metrics㤼㸲 was built under R version 4.0.3
This is the csv where all the NAs are included and we will just drop all NA values
Read in the data that we want to work with
data_na = read.csv("allMerge_clean_withNA.csv")
head(data_na)
Do some basic class conversions
#converting classes
data_na <- mutate_if(data_na, is.factor, as.character())
data_na$budget <- as.numeric(data_na$budget)
data_na$Total.Revenue = as.numeric(data_na$Total.Revenue)
head(data_na)
Extract the Year from the release_date column and store it in a variable called year
data_na$Year = str_extract(data_na$release_date, "\\d{4}")
head(data_na)
Create a table that will give us the adjustment amount based on a base year of 2020
table = inflation_adjust(2020)
trying URL 'https://download.bls.gov/pub/time.series/cu/cu.data.1.AllItems'
Content type 'text/plain' length unknown
downloaded 2.3 MB
table
Create a data frame with the values we need from the table
table <- as.data.frame(table)
table$adj_value2 <- ((100 + table$pct_increase)/100)
df <- table[,c("year","adj_value2")]
colnames(df) = c("Year", "adj_value") #changing name for left_join
df
Merge the data frame from above with this data frame based on the year
data_na = left_join(df, data_na, by = 'Year')
data_na
Convert adjusted_revenue and adjusted budget to an integer, we do this because there are a lot of decimals in some cases as the adjustments are very specific
data_na$adjusted_revenue = as.integer(data_na$Total.Revenue/data_na$adj_value)
NAs introduced by coercion to integer range
data_na$adjusted_budget = as.integer(data_na$budget/data_na$adj_value)
data_na
Drop all the columns that we will not be using
data_na = subset(data_na, select = -c(homepage, id, imdb_id, overview, poster_path, revenue, status, video, original_title, orginal_title_2, year_2, Year, adj_value, budget, Total.Revenue, title, X,production_countries, production_companies, tagline, spoken_languages, genres, cast, crew, belongs_to_collection, prod_comp_name, adult, original_language))
Also drop all the NA values from the dataset, leaving us with ~5000 data points
data_na <- drop_na(data_na)
Convert the variable types to factors
data_na$release_date <- as.Date(data_na$release_date)
data_na <- data_na %>% mutate_if(is.logical,as.factor)
data_na <- data_na %>% mutate_if(is.character,as.factor)
head(data_na)
It is very easy to run the boruta model, but can be time consuming in many cases, so we will maximize the runs at 100, by which almost all of the variables will be classified as important or unimportant.
featureSelection_na <- Boruta(adjusted_revenue ~ ., data = data_na, doTrace = 2, maxRuns = 100)
1. run of importance source...
2. run of importance source...
3. run of importance source...
4. run of importance source...
5. run of importance source...
6. run of importance source...
7. run of importance source...
8. run of importance source...
9. run of importance source...
10. run of importance source...
11. run of importance source...
12. run of importance source...
13. run of importance source...
After 13 iterations, +3 mins:
confirmed 30 attributes: action, adjusted_budget, adventure, animation, comedy and 25 more;
rejected 8 attributes: documentary, foreign, mgm, num_spoken_languages, tv_movie and 3 more;
still have 14 attributes left.
14. run of importance source...
15. run of importance source...
16. run of importance source...
17. run of importance source...
After 17 iterations, +3.8 mins:
confirmed 2 attributes: fantasy, romance;
rejected 3 attributes: mystery, new_line_cinema, rko_radio;
still have 9 attributes left.
18. run of importance source...
19. run of importance source...
20. run of importance source...
21. run of importance source...
After 21 iterations, +4.6 mins:
confirmed 1 attribute: warner_bros;
still have 8 attributes left.
22. run of importance source...
23. run of importance source...
24. run of importance source...
After 24 iterations, +5.1 mins:
confirmed 1 attribute: production_country;
rejected 1 attribute: music;
still have 6 attributes left.
25. run of importance source...
26. run of importance source...
27. run of importance source...
After 27 iterations, +5.6 mins:
rejected 1 attribute: history;
still have 5 attributes left.
28. run of importance source...
29. run of importance source...
30. run of importance source...
31. run of importance source...
32. run of importance source...
33. run of importance source...
34. run of importance source...
35. run of importance source...
36. run of importance source...
37. run of importance source...
38. run of importance source...
39. run of importance source...
40. run of importance source...
41. run of importance source...
42. run of importance source...
After 42 iterations, +8.7 mins:
confirmed 1 attribute: universal;
still have 4 attributes left.
43. run of importance source...
44. run of importance source...
45. run of importance source...
46. run of importance source...
47. run of importance source...
48. run of importance source...
49. run of importance source...
50. run of importance source...
51. run of importance source...
52. run of importance source...
53. run of importance source...
54. run of importance source...
55. run of importance source...
56. run of importance source...
57. run of importance source...
58. run of importance source...
59. run of importance source...
60. run of importance source...
61. run of importance source...
62. run of importance source...
63. run of importance source...
64. run of importance source...
65. run of importance source...
66. run of importance source...
67. run of importance source...
68. run of importance source...
After 68 iterations, +14 mins:
rejected 1 attribute: columbia;
still have 3 attributes left.
69. run of importance source...
70. run of importance source...
71. run of importance source...
72. run of importance source...
73. run of importance source...
74. run of importance source...
75. run of importance source...
76. run of importance source...
77. run of importance source...
78. run of importance source...
79. run of importance source...
80. run of importance source...
81. run of importance source...
82. run of importance source...
83. run of importance source...
84. run of importance source...
85. run of importance source...
86. run of importance source...
87. run of importance source...
88. run of importance source...
89. run of importance source...
90. run of importance source...
91. run of importance source...
92. run of importance source...
93. run of importance source...
94. run of importance source...
95. run of importance source...
96. run of importance source...
97. run of importance source...
98. run of importance source...
99. run of importance source...
We can plot the feature selection that boruta returns to get more insight about the relevance of certain variables
plot(featureSelection_na, las = 2, cex.axis = 0.5)
We get a lot of interesting initial results telling us the adjusted_budget is the most relevant variable in predicting adjusted_revenue by far when compared to almost all other variables. is_in_collection and vote_count are also very important variable in predicting revenue and not far behind adjusted revenue. This intuitively makes a lot of sense that both of these variables should effect revenue. You would definitely expect budget and revenue to be positively correlated. You would also expect vote count to go up for “good movies”, and generally a sequel is made when the first movie does really well in terms of box office.
There are various variables (mainly the dummy variables created for production company and genres) that are deemed not important like mgm, music,rko_radio etc. Of all the unimportant variables only 1 is not a dummy variable and that is num_languages_spoken.
There are also 3 variables that are tentative, but we can see where they are based on the graph above, so we can choose what to do with them.
Interestingly the ranking of the variables is not the same when you run it multiple times. Based on the seed the ranking changes slightly, obviously adjusted_budget, is_in_collection and release_date were always the top 3 but between the genres it would differ sometimes. So we cannot definitively say that if a movie’s genre is adventure or family it has a bigger impact on revenue.
So we will fix the tentative variables (it will assign the tentative variables under important or unimportant using the information it already has) and we will get the formula that we will plug into the random forest model
featureSelectionFinal_na <- TentativeRoughFix(featureSelection_na)
getNonRejectedFormula(featureSelectionFinal_na)
adjusted_revenue ~ popularity + release_date + runtime + vote_average +
vote_count + meterScore + meterClass + is_in_collection +
has_tagline + number_of_cast + female_cast + male_cast +
unspecified_cast + number_of_crew + female_crew + male_crew +
unspecified_crew + comedy + horror + action + drama + fantasy +
thriller + animation + adventure + romance + family + twentieth_century +
warner_bros + universal + walt_disney + prod_size + num_production_companies +
production_country + adjusted_budget
<environment: 0x000002487937acb8>
featureSelection_na
Boruta performed 99 iterations in 19.76438 mins.
35 attributes confirmed important: action, adjusted_budget, adventure, animation, comedy
and 30 more;
14 attributes confirmed unimportant: columbia, documentary, foreign, history, mgm and 9
more;
3 tentative attributes left: crime, paramount, science_fiction;
Do a one way ANOVA of all the variables against adjusted_revenue
one.way_na <- aov(adjusted_revenue ~ ., data = data_na)
summary(one.way_na)
Df Sum Sq Mean Sq F value Pr(>F)
popularity 1 2.967e+19 2.967e+19 1760.028 < 2e-16 ***
release_date 1 4.130e+17 4.130e+17 24.498 7.71e-07 ***
runtime 1 1.061e+19 1.061e+19 629.217 < 2e-16 ***
vote_average 1 2.752e+18 2.752e+18 163.241 < 2e-16 ***
vote_count 1 8.270e+19 8.270e+19 4904.880 < 2e-16 ***
meterScore 1 6.038e+16 6.038e+16 3.582 0.058488 .
meterClass 2 2.772e+17 1.386e+17 8.220 0.000273 ***
is_in_collection 1 7.698e+18 7.698e+18 456.566 < 2e-16 ***
num_spoken_languages 1 7.541e+15 7.541e+15 0.447 0.503653
has_tagline 1 1.109e+17 1.109e+17 6.577 0.010361 *
number_of_cast 1 2.252e+16 2.252e+16 1.336 0.247882
female_cast 1 1.022e+17 1.022e+17 6.061 0.013854 *
male_cast 1 4.421e+17 4.421e+17 26.220 3.17e-07 ***
unspecified_cast 1 6.132e+15 6.132e+15 0.364 0.546494
number_of_crew 1 1.847e+17 1.847e+17 10.955 0.000941 ***
female_crew 1 7.784e+16 7.784e+16 4.617 0.031709 *
male_crew 1 2.262e+17 2.262e+17 13.417 0.000252 ***
comedy 1 7.453e+17 7.453e+17 44.203 3.31e-11 ***
horror 1 1.353e+18 1.353e+18 80.230 < 2e-16 ***
action 1 1.497e+16 1.497e+16 0.888 0.346096
drama 1 7.573e+17 7.573e+17 44.917 2.31e-11 ***
documentary 1 4.162e+15 4.162e+15 0.247 0.619336
science_fiction 1 3.033e+17 3.033e+17 17.990 2.26e-05 ***
crime 1 9.873e+17 9.873e+17 58.557 2.40e-14 ***
fantasy 1 4.826e+17 4.826e+17 28.626 9.21e-08 ***
thriller 1 1.904e+17 1.904e+17 11.294 0.000784 ***
animation 1 3.384e+18 3.384e+18 200.703 < 2e-16 ***
adventure 1 8.017e+17 8.017e+17 47.551 6.09e-12 ***
mystery 1 7.549e+13 7.549e+13 0.004 0.946652
war 1 3.601e+16 3.601e+16 2.136 0.143959
romance 1 1.593e+17 1.593e+17 9.448 0.002126 **
music 1 5.599e+16 5.599e+16 3.321 0.068468 .
family 1 7.977e+17 7.977e+17 47.314 6.87e-12 ***
western 1 1.253e+17 1.253e+17 7.434 0.006423 **
history 1 7.467e+16 7.467e+16 4.429 0.035387 *
tv_movie 1 5.466e+15 5.466e+15 0.324 0.569123
foreign 1 1.126e+16 1.126e+16 0.668 0.413839
paramount 1 1.942e+17 1.942e+17 11.516 0.000696 ***
mgm 1 3.142e+16 3.142e+16 1.863 0.172293
twentieth_century 1 1.572e+17 1.572e+17 9.327 0.002271 **
warner_bros 1 2.689e+15 2.689e+15 0.159 0.689659
universal 1 4.717e+17 4.717e+17 27.980 1.28e-07 ***
columbia 1 8.226e+16 8.226e+16 4.879 0.027230 *
rko_radio 1 4.835e+16 4.835e+16 2.868 0.090443 .
united_artists 1 1.069e+17 1.069e+17 6.342 0.011824 *
walt_disney 1 8.000e+17 8.000e+17 47.453 6.40e-12 ***
new_line_cinema 1 3.783e+16 3.783e+16 2.244 0.134213
prod_size 1 7.599e+16 7.599e+16 4.507 0.033810 *
num_production_companies 1 2.429e+17 2.429e+17 14.406 0.000149 ***
production_country 31 6.640e+17 2.142e+16 1.270 0.144673
adjusted_budget 1 9.795e+18 9.795e+18 580.978 < 2e-16 ***
Residuals 4558 7.685e+19 1.686e+16
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As we know from DMA the cutoff p value is 0.05 so anything above that is not important and anything below that is important. The stars beside the variable also tell us level of importance, but do not give us a clear outcome on which variables are the most important as it just says <2e-16 telling us that they are very important but not ranking them.
There are 6 variables that have a p value of <2e-16, and three of them are vote_count, adjusted_budget and is_in_collection so the top 3 are the same for each feature selection.
The number of features selected is relatively the same but there are differences in the features selected for example unspecified cast is deemed not important by the ANOVA model but it is important according to the Boruta model.
After running the models we can see which one generally performed better
As mentioned at the beginning of the notebook we wanted to run the model multiple times because of the range in error based on the train test split. for the NA model we will be running it 30 times as it did not take toooo long to run. The reason we are using 200 trees was explained in the second notebook and that is where we plotted the random forest to see what the ideal number of trees would be.
#We want to stroe the error metrics to analyze later
rmse_na_boruta <- c()
mape_na_boruta <- c()
for(i in 1:30){
# Train test split
num_samples = dim(data_na)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(data_na[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(data_na[testing, ])
#Train the model
randomForestModel <- randomForest(adjusted_revenue ~ popularity + release_date + runtime + vote_average +
vote_count + meterScore + meterClass + is_in_collection +
has_tagline + number_of_cast + female_cast + male_cast +
unspecified_cast + number_of_crew + female_crew + male_crew +
unspecified_crew + comedy + horror + action + drama + fantasy +
thriller + animation + adventure + romance + family + twentieth_century +
warner_bros + universal + walt_disney + prod_size + num_production_companies +
production_country + adjusted_budget, data=trainingSet, ntree=200)
#Calculate the error
predictions <- predict(randomForestModel, testingSet)
error = predictions - testingSet$adjusted_revenue
mse = mean(error^2)
rmse_na_boruta[i] <- sqrt(mse)
errorpct <- ((abs(testingSet$adjusted_revenue - predictions))/testingSet$adjusted_revenue)
mape_na_boruta[i] <- mean(errorpct)
}
rmse_na_boruta
[1] 124737050 102558693 122492259 116916375 127019762 107007146 123109015 122740541 117728775 119541009
[11] 118853588 120273504 117118753 119396163 123954153 117978901 113296198 121125617 111351457 122997194
[21] 112044798 123644663 136100863 123790583 115179253 132892067 119202705 108628987 130820551 122639466
mape_na_boruta
[1] 34.20468 30.66354 152.84844 22.76226 96.28503 31.41667 105.11008 121.31701 88.21617 33.38971
[11] 27.86385 22.46998 29.27227 32.18365 107.26778 99.29217 100.83019 89.48064 143.62672 16.43567
[21] 29.07167 98.65965 104.70093 50.53272 14.62367 45.66011 49.17229 110.73941 38.78067 132.50135
We initially see that RMSE is really large $100M range but we know that the error is significantly exaggerated as the errors are already big, plus there are 5K data points so it is not a good measure of error. For this model we were able to run 30 train test splits and we can see that the range in MAPE is really big from 14% to 152% we can calculate the average on that to see
mean(rmse_na_boruta)
[1] 119838003
mean(mape_na_boruta)
[1] 68.64596
The average MAPE is 68% which is not the best but we already know that we are not able to predict the box office with our variables very accurately but we can compare feature selection models
To compare this model against the models done in the second R notebook we will take RMSE/ average adjusted_revenue
mean(rmse_na_boruta)/mean(data_na$adjusted_revenue)
[1] 0.8486409
This number is significantly less than the numbers that we got in the second notebook, this shows that feature selection definitely improves accuracy. There is also the added benefit of less computational power required as we are analyzing less features. We can also see the effect RMSE has and how much it exaggerated the error by ~15% in this case.
Similar to the model above we will be running it 30 times with 200 trees.
rmse_na_anova <- c()
mape_na_anova <- c()
for(i in 1:30){
num_samples = dim(data_na)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(data_na[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(data_na[testing, ])
randomForestModel <- randomForest(adjusted_revenue ~ .- meterScore - num_spoken_languages - number_of_cast - unspecified_cast - action - documentary - mystery - war - music - tv_movie - foreign - mgm - warner_bros - rko_radio - new_line_cinema - production_country, data=trainingSet, ntree=200)
predictions <- predict(randomForestModel, testingSet)
error = predictions - testingSet$adjusted_revenue
mse = mean(error^2)
rmse_na_anova[i] <- sqrt(mse)
errorpct <- ((abs(testingSet$adjusted_revenue - predictions))/testingSet$adjusted_revenue)
mape_na_anova[i] <- mean(errorpct)
}
rmse_na_anova
[1] 140488870 102845115 117452615 101737333 114887395 125011860 137748987 127557539 115593173 123881524
[11] 117964319 108750962 102722889 124210814 101715401 111854977 124251609 119389020 122300600 101127190
[21] 127151025 118376500 134055430 136803209 127967431 133128920 108916743 120660171 113641974 115480796
mape_na_anova
[1] 98.42916 53.70034 51.86697 28.61978 44.88179 20.40744 42.70237 20.94986 42.46423 40.16076
[11] 18.14329 84.97976 111.69401 25.81500 32.95896 100.36770 23.03898 16.53286 69.27055 44.28644
[21] 76.80656 19.69833 52.00450 60.92622 23.09176 38.14424 134.95396 87.59915 37.24281 58.85535
We initially see that RMSE is really large $100M range but we know that the error is significantly exaggerated as the errors are already big, plus there are 5K data points so it is not a good measure of error. looking at the MAPE’s range is smaller than the range of MAPE for the boruta model, ranging from 16% to 111%
mean(rmse_na_anova)
[1] 119255813
mean(mape_na_anova)
[1] 52.01977
As we could already see the features from ANOVA performed better. Now something that we realized after running the model is that we cannot actually compare right now, because the train test split is different so this result could be because of the train test split but not because of the train test split rather than the results. This is something we wanted to fix but this model needs to run overnight and now we do not have any more time as we have to submit
To compare this model against the models done in the second R notebook we will take RMSE/ average adjusted_revenue
mean(rmse_na_anova)/mean(data_na$adjusted_revenue)
[1] 0.8445181
This error metric is very comparable to the ANOVA model done in notebook two. The reason for that is because it is done on the same data set which means that we can see the impact that just adjusting revenue and budget for time has on the model. So in this case we are getting a rmse as percentage of revenue at 84% where as in book two we got a result of 189%, so the adjustment for time value of money definately had a very positive impact on the accuracy of the model.
##Feature Selection on impuated data
Read in the file
data_rf = read.csv("rf_imputations_3.csv")
head(data_rf)
Do the same basic type conversions
#converting classes
data_rf <- mutate_if(data_rf, is.factor, as.character())
data_rf$budget <- as.numeric(data_rf$budget)
data_rf$Total.Revenue = as.numeric(data_rf$Total.Revenue)
head(data_rf)
Extract year from the release date column
data_rf$Year = str_extract(data_rf$release_date, "\\d{4}")
head(data_rf)
Find the adjustment value based on the base year of 2020 (scrape the US Beaurea website for information about this)
table = inflation_adjust(2020)
trying URL 'https://download.bls.gov/pub/time.series/cu/cu.data.1.AllItems'
Content type 'text/plain' length unknown
downloaded 2.3 MB
table
Extract the adj_value in a simple form and put it into a data frame so we can join it
table <- as.data.frame(table)
table$adj_value2 <- ((100 + table$pct_increase)/100)
df <- table[,c("year","adj_value2")]
colnames(df) = c("Year", "adj_value") #changing name for left_join
df
Join the main data with the adjustment and join by Year
data_rf = left_join(df, data_rf, by = 'Year')
data_rf
Calculate the adjusted Budget and Revenue, and convert the value to an integer
data_rf$adjusted_revenue = as.integer(data_rf$Total.Revenue/data_rf$adj_value)
NAs introduced by coercion to integer range
data_rf$adjusted_budget = as.integer(data_rf$budget/data_rf$adj_value)
data_rf
This introduced some NAs so we will remove them
data_rf <- drop_na(data_rf)
Drop the columns that we will not be needing anymore
data_rf <- subset(data_rf, select = -c(Year, adj_value, budget, Total.Revenue, title, X, original_language))
Change type of variables to factor and date accordingly
data_rf$release_date <- as.Date(data_rf$release_date)
data_rf <- data_rf %>% mutate_if(is.logical,as.factor)
data_rf <- data_rf %>% mutate_if(is.character,as.factor)
head(data_rf)
It is very easy to run the boruta model, but can be time consuming in many cases, so we will maximize the runs at 60, by which almost all of the variables will be classified as important or unimportant.
featureSelection_rf <- Boruta(adjusted_revenue ~ ., data = data_rf, doTrace = 2, maxRuns = 30)
1. run of importance source...
Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 36 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 5 seconds.
2. run of importance source...
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 38 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 6 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 4 seconds.
3. run of importance source...
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 40 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 1 minute, 8 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 38 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 7 seconds.
4. run of importance source...
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 1 minute, 5 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 33 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 3 seconds.
5. run of importance source...
Growing trees.. Progress: 87%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 38 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 42 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 14 seconds.
6. run of importance source...
Growing trees.. Progress: 78%. Estimated remaining time: 8 seconds.
Computing permutation importance.. Progress: 21%. Estimated remaining time: 1 minute, 53 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 53 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 23 seconds.
7. run of importance source...
Growing trees.. Progress: 73%. Estimated remaining time: 11 seconds.
Computing permutation importance.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 1 minute, 35 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 29 seconds.
8. run of importance source...
Growing trees.. Progress: 73%. Estimated remaining time: 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 1 minute, 41 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 1 minute, 8 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 36 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 5 seconds.
9. run of importance source...
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 6 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 3 seconds.
10. run of importance source...
Growing trees.. Progress: 87%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 39 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 8 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 36 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 5 seconds.
11. run of importance source...
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 39 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 36 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 5 seconds.
12. run of importance source...
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 1 minute, 6 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 4 seconds.
13. run of importance source...
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 1 minute, 6 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 4 seconds.
After 13 iterations, +37 mins:
confirmed 30 attributes: action, adjusted_budget, adventure, animation, comedy and 25 more;
rejected 11 attributes: foreign, history, mgm, mystery, num_spoken_languages and 6 more;
still have 11 attributes left.
14. run of importance source...
Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 32%. Estimated remaining time: 1 minute, 5 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 33 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 2 seconds.
15. run of importance source...
Growing trees.. Progress: 98%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 1 minute, 4 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 33 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 1 seconds.
16. run of importance source...
Computing permutation importance.. Progress: 35%. Estimated remaining time: 58 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 25 seconds.
17. run of importance source...
Computing permutation importance.. Progress: 36%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 22 seconds.
After 17 iterations, +46 mins:
rejected 2 attributes: paramount, rko_radio;
still have 9 attributes left.
18. run of importance source...
Computing permutation importance.. Progress: 35%. Estimated remaining time: 56 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 24 seconds.
19. run of importance source...
Computing permutation importance.. Progress: 35%. Estimated remaining time: 56 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 24 seconds.
20. run of importance source...
Computing permutation importance.. Progress: 38%. Estimated remaining time: 51 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 20 seconds.
21. run of importance source...
Computing permutation importance.. Progress: 37%. Estimated remaining time: 51 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 18 seconds.
22. run of importance source...
Computing permutation importance.. Progress: 39%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 18 seconds.
23. run of importance source...
Computing permutation importance.. Progress: 39%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 17 seconds.
24. run of importance source...
Computing permutation importance.. Progress: 39%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 17 seconds.
After 24 iterations, +59 mins:
confirmed 1 attribute: warner_bros;
rejected 1 attribute: new_line_cinema;
still have 7 attributes left.
25. run of importance source...
Computing permutation importance.. Progress: 40%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 14 seconds.
26. run of importance source...
Computing permutation importance.. Progress: 40%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 15 seconds.
27. run of importance source...
Computing permutation importance.. Progress: 40%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 14 seconds.
After 27 iterations, +1.1 hours:
confirmed 1 attribute: unspecified_cast;
still have 6 attributes left.
28. run of importance source...
Computing permutation importance.. Progress: 40%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 14 seconds.
29. run of importance source...
Computing permutation importance.. Progress: 39%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 16 seconds.
After 30 runs which took around 1.5 hours it was able to classify all but 3 features which is good
Plot the feature selection to get more information about it
plot(featureSelection_rf, las = 2, cex.axis = 0.5)
Majority of the features are classified the same way for both the dataset, so the imputed data set does not significantly change the featrues used.
Do a tentative fix to the tentative features (assign them to either important or unimportant) and get the formula
featureSelectionFinal_rf <- TentativeRoughFix(featureSelection_rf)
getNonRejectedFormula(featureSelectionFinal_rf)
adjusted_revenue ~ popularity + release_date + runtime + vote_average +
vote_count + meterScore + is_in_collection + has_tagline +
number_of_cast + female_cast + male_cast + unspecified_cast +
number_of_crew + female_crew + male_crew + unspecified_crew +
comedy + horror + action + drama + documentary + crime +
thriller + animation + adventure + romance + family + twentieth_century +
warner_bros + columbia + walt_disney + new_line_cinema +
prod_size + num_production_companies + production_country +
adjusted_budget
<environment: 0x000002487e8d5108>
just take a look at the feature selection overall again
featureSelection_rf
Boruta performed 59 iterations in 2.574235 hours.
34 attributes confirmed important: action, adjusted_budget, adventure, animation,
columbia and 29 more;
15 attributes confirmed unimportant: fantasy, foreign, history, mgm, music and 10
more;
3 tentative attributes left: crime, meterClass, new_line_cinema;
Do a one way ANOVA of all the variables against adjusted_revenue
one.way_rf<- aov(adjusted_revenue ~ ., data = data_rf)
summary(one.way_rf)
Df Sum Sq Mean Sq F value Pr(>F)
popularity 1 6.614e+19 6.614e+19 11954.568 < 2e-16 ***
release_date 1 1.644e+18 1.644e+18 297.183 < 2e-16 ***
runtime 1 3.093e+18 3.093e+18 559.137 < 2e-16 ***
vote_average 1 5.643e+17 5.643e+17 102.000 < 2e-16 ***
vote_count 1 1.161e+20 1.161e+20 20985.186 < 2e-16 ***
meterScore 1 2.110e+16 2.110e+16 3.815 0.050819 .
meterClass 2 2.879e+16 1.439e+16 2.602 0.074165 .
is_in_collection 1 5.389e+18 5.389e+18 974.151 < 2e-16 ***
num_spoken_languages 1 2.173e+16 2.173e+16 3.928 0.047504 *
has_tagline 1 1.765e+17 1.765e+17 31.910 1.64e-08 ***
number_of_cast 1 2.001e+17 2.001e+17 36.164 1.85e-09 ***
female_cast 1 1.424e+17 1.424e+17 25.738 3.95e-07 ***
male_cast 1 1.286e+18 1.286e+18 232.484 < 2e-16 ***
unspecified_cast 1 1.008e+16 1.008e+16 1.822 0.177082
number_of_crew 1 1.719e+17 1.719e+17 31.080 2.51e-08 ***
female_crew 1 2.943e+16 2.943e+16 5.319 0.021101 *
male_crew 1 7.688e+17 7.688e+17 138.970 < 2e-16 ***
unspecified_crew 1 1.399e+14 1.399e+14 0.025 0.873675
comedy 1 2.397e+17 2.397e+17 43.330 4.74e-11 ***
horror 1 8.514e+17 8.514e+17 153.895 < 2e-16 ***
action 1 1.695e+15 1.695e+15 0.306 0.579967
drama 1 5.098e+17 5.098e+17 92.152 < 2e-16 ***
documentary 1 1.829e+15 1.829e+15 0.331 0.565298
science_fiction 1 1.829e+17 1.829e+17 33.059 9.07e-09 ***
crime 1 6.195e+17 6.195e+17 111.971 < 2e-16 ***
fantasy 1 5.125e+17 5.125e+17 92.646 < 2e-16 ***
thriller 1 1.751e+17 1.751e+17 31.652 1.87e-08 ***
animation 1 6.767e+17 6.767e+17 122.314 < 2e-16 ***
adventure 1 1.229e+18 1.229e+18 222.110 < 2e-16 ***
mystery 1 1.158e+15 1.158e+15 0.209 0.647251
war 1 4.837e+15 4.837e+15 0.874 0.349786
romance 1 8.649e+16 8.649e+16 15.633 7.72e-05 ***
music 1 3.395e+16 3.395e+16 6.138 0.013242 *
family 1 9.038e+17 9.038e+17 163.377 < 2e-16 ***
western 1 7.320e+16 7.320e+16 13.232 0.000276 ***
history 1 4.236e+15 4.236e+15 0.766 0.381575
tv_movie 1 4.181e+15 4.181e+15 0.756 0.384668
foreign 1 5.709e+15 5.709e+15 1.032 0.309729
paramount 1 5.696e+17 5.696e+17 102.958 < 2e-16 ***
mgm 1 1.343e+16 1.343e+16 2.428 0.119213
twentieth_century 1 3.535e+17 3.535e+17 63.894 1.38e-15 ***
warner_bros 1 1.198e+17 1.198e+17 21.656 3.28e-06 ***
universal 1 7.951e+17 7.951e+17 143.726 < 2e-16 ***
columbia 1 3.248e+17 3.248e+17 58.719 1.90e-14 ***
rko_radio 1 4.041e+16 4.041e+16 7.305 0.006883 **
united_artists 1 1.399e+17 1.399e+17 25.286 4.99e-07 ***
walt_disney 1 1.949e+18 1.949e+18 352.375 < 2e-16 ***
new_line_cinema 1 4.907e+16 4.907e+16 8.869 0.002904 **
prod_size 1 5.327e+16 5.327e+16 9.630 0.001917 **
num_production_companies 1 2.998e+17 2.998e+17 54.196 1.89e-13 ***
production_country 31 6.160e+17 1.987e+16 3.592 5.93e-11 ***
adjusted_budget 1 6.591e+18 6.591e+18 1191.375 < 2e-16 ***
Residuals 19637 1.086e+20 5.532e+15
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Once again majority of the features’ classification is the same as the ANOVE on dataset where all the NA’s were dropped, but thee is an interesting difference where meter Score was very significant in the NA dataset, but on the imputed one it is not important, this could possibly mean that the imputations for this variable are not very accurate or this can mean that meterClass had a big impact on the movies in the first dataset but not as much of an impact when the other 15K movies were included.
As mentioned at the beginning of the notebook we wanted to run the model multiple times because of the range in error based on the train test split. for the NA model we will be running it 3 times as it took a long to run. The reason we are using 200 trees was explained in the second notebook and that is where we plotted the random forest to see what the ideal number of trees would be.
rmse_rf_boruta <- c()
mape_rf_boruta <- c()
for(i in 1:3){
# Train test split
num_samples = dim(data_rf)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(data_rf[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(data_rf[testing, ])
#Train the model
randomForestModel <- randomForest(adjusted_revenue ~ popularity + release_date +
runtime + vote_average + vote_count + meterScore + meterClass +
is_in_collection + has_tagline + number_of_cast + female_cast +
male_cast + unspecified_cast + number_of_crew + female_crew +
male_crew + unspecified_crew + comedy + horror + action +
drama + documentary + thriller + animation + adventure +
romance + family + twentieth_century + warner_bros + columbia +
walt_disney + prod_size + num_production_companies + production_country +
adjusted_budget, data=trainingSet, ntree=200)
#Calculate the error
predictions <- predict(randomForestModel, testingSet)
error = predictions - testingSet$adjusted_revenue
mse = mean(error^2)
rmse_rf_boruta[i] <- sqrt(mse)
errorpct <- ((abs(testingSet$adjusted_revenue - predictions))/testingSet$adjusted_revenue)
mape_rf_boruta[i] <- mean(errorpct)
}
rmse_rf_boruta
[1] 69699126 61177562 59283379
mape_rf_boruta
[1] 132.6351 106.4524 188.4149
We initially see that RMSE is really large $50M range but we know that the error is significantly exaggerated as the errors are already big, plus there are 20K data points so it is not a good measure of error. For this model we were only able to run it 3 times it takes a very long time for each run. So the error is hard to take at its current value. The interesting part is that the RMSE decreased significantly while the MAPE increased significantly. This is likely because all the NA’s that we dropped in the first data set resulted in the movies that were left to have higher revenue. This makes sense as movies that are bigger (more boxoffice) are generally more well documented. This can be confirmed by taking the average of both revenue columns
mean(data_na$adjusted_revenue)
[1] 141211676
mean(data_rf$adjusted_revenue)
[1] 42918089
The hypothesis is confirmed and the imputed data has 4 times the data points be the average revenue decreases significantly
mean(rmse_rf_boruta)
[1] 63386689
mean(mape_rf_boruta)
[1] 142.5008
We took the average to compare to the NA values, and the MAPE is ~2.5x worse which is not good. This could be because there are errors that happen when imputing data and then using imputed data to predict adds to the error. This likely leads us to the conclusion that imputing the data does not always result in better error results, at least not with “simple” models like random forest.
To compare this model against the models done in the second R notebook we will take RMSE/ average adjusted_revenue
mean(rmse_rf_boruta)/mean(data_rf$adjusted_revenue)
[1] 1.476922
This error metric is less than the results in the second notebook. While it is not significantly less it shows that at the very least feature selection has a positive impact on the error.
Similar to the model above we will be running it 3 times with 200 trees.
rmse_rf_anova <- c()
mape_rf_anova <- c()
for(i in 1:3){
#Train test split
num_samples = dim(data_rf)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- subset(data_rf[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(data_rf[testing, ])
#Train the model
randomForestModel <- randomForest(adjusted_revenue ~ . - meterClass - meterScore - unspecified_cast - unspecified_crew - action - documentary - mystery - war - history - tv_movie - foreign - mgm, data=trainingSet, ntree=200)
#Calcualte the error
predictions <- predict(randomForestModel, testingSet)
error = predictions - testingSet$adjusted_revenue
mse = mean(error^2)
rmse_rf_anova[i] <- sqrt(mse)
errorpct <- ((abs(testingSet$adjusted_revenue - predictions))/testingSet$adjusted_revenue)
mape_rf_anova[i] <- mean(errorpct)
}
rmse_rf_anova
[1] 58030853 58763061 62084336
mape_rf_anova
[1] 112.7862 148.1727 139.2491
The RMSE and MAPE is lower for the ANOVA selection then the boruta selection again this leads us to believe that maybe in this case ANOVA selection is actually better than the boruta selection. but once again we do not actually know this because we did not run it on the same train/test split.
mean(rmse_rf_anova)
[1] 59626083
mean(mape_rf_anova)
[1] 133.4027
To compare this model against the models done in the second R notebook we will take RMSE/ average adjusted_revenue
mean(rmse_rf_anova)/mean(data_rf$adjusted_revenue)
[1] 1.3893
Once again this error metric is less than the results in the second notebook
As we do not have a lot of computational power we were not able to run a neural network model. It was already taking the entire night to run the random forest model so neural net models would have taken much longer. We believe that if we are able to find the correct structure to a neural net model then it would perform better on the imputed data compared to the data where we drop the na’s. With 4 times the amount of training data we are able to better train the model and decrease MAPE.
Another thing that we were not able to do in this project is some sort of optimization where we determine the correct combination of variables to optimize revenue.
Another interesting problem that we are facing is that the MAPE is drastically different based on the train/test split and we wanted to learn what could be the reason for this and how can we fix it.